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PREFACE 

Inventory,  maintenance,  and  queuing  models  are  the 
lifeblood  of  Air  Force  logistics  research.  Markov  renewal 
programs  are  imbedded  in  most  of  those  that  are  amenable 
to  analytic  treatment.  When  they  are  explicitly  recognized, 
the  analysis  is  often  streamlined.  Contrary  to  usual 
practice,  we  do  not  make  the  (generally  unrealistic) 
assumption  that  all  parameters  of  the  model  are  known. 
However,  information  about  then  is  acquired  sequentially 
by  observing  the  reward  stream,  transition  times,  and 
successive  states  visited,  which  depend  on  the  policy 
employed. 

For  maximizing  long-run  average  reward,  we  find  a 
history— remembering,  adaptive  policy  that  does  as  well 
as  we  could  if  we  knew  all  the  parameters. 


SUMMARY 


We  recast  a  class  of  infinite -state ,  infinite— action 
Markov  renewal  programs  with  unknown  parameters  as  one— state 
programs  with  actions  corresponding  to  stationary  policies 
in  the  original  program.  Under  suitable  conditions  we  find 
an  adaptive  (nonstationary)  optimal  policy  in  the  sense  of 
maximizing  long-run  expected  reward  per  unit  time. 
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ADAPTIVE  POLICIES  FOR  MARKOV  RENEWAL  PROGRAMS 

I ■  INTRODUCTION 

Finite— state ,  finite— action  Markov  renewal  programs 
with  all  parameters  known  were  defined  by  Jewell  in  [9]. 

We  study  the  infinite-state,  infinite-action  analog  with 
the  parameters  unknown.  One— step  reward  distributions, 
transition  time  distributions,  and  transition  probabilities 
are  unknown  a  priori.  Beginning  in  state  i  e  S,  the  decision¬ 
maker  takes  action  k  e  A^,  moves  to  state  j  with  probability 

k  k 

p^j ,  and  given  that  he  moves  to  j ,  receives  reward  during 

a  transition  lasting  T^j  after  which  he  takes  another  action, 

has  a  transition,  etc.  His  objective  is  to  find  a  policy 

which  maximizes  his  expected  long-run  average  reward.  A 

policy  (6^,  •••)  is  a  collection  of  functions  mapping 

states  into  actions.  At  the  n— th  decision  (transition), 

Vi):  1  -*  A^  where  6^  may  depend  on  the  history  of  the 

process  prior  to  n.  A  stationary  policy  6  is  of  the  form 

(6,  6,  ...);  it  uses  the  same  function  for  each  decision  and 

thus  cannot  be  history— remembering.  Define  A  to  be  the  set 

of  all  stationary  policies.  Making  certain  assumptions,  we 

construct  a  nonstationary  adaptive  policy  which  does  as  well 

as  any  stationary  policy  in  maximizing  expected  reward  per  _ 

unit  time  no  matter  what  values  the  unknown  parameters  have. 

Thus,  using  the  average  reward  rate  criterion,  our  policy 
is  optimal  whenever  a  stationary  optimal  or  stationary 
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e— optimal  policy  exists.  Lippman  [10]  shows  the  existence 
of  a  stationary  e— optimal  policy  under  essentially  our 
assumptions,  although  in  general  a  stationary  optimal  policy 
need  not  exist.  Similar  results  are  unattainable  in  the 
general  multichain  case  because  there  is  no  way  to  be  sure 
of  optimizing  the  action  in  a  transient  state  when  the  action 
determines  which  absorbing  chain  will  be  entered.  A  stationary 
optimal  policy  exists  in  the  finite— state,  finite— action  case 
(Fox  [5]),  and  sufficient  conditions  for  existence  have  been 
given  (Fox  [6])  in  the  finite— state,  infinite-action  case. 

Mallows  and  Robbins  [12]  give  results  analogous  to 
ours  in  the  discrete— time,  one— state  case.  Part  of  our 
argument  is  an  adaptation  of  theirs.  Banos  [1]  and  Shubert 
[I6J  treat  similar  problems  from  game  theoretic  and  statis¬ 
tical  decision  theoretic  viewpoints,  respectively. 


2.  THE  MAXIMIZING  POLICY 


Suppose  a  stationary  policy  6  c  A  is  always  used. 

For  any  fixed  path  let 

R6(t)  »  total  reward  received  up  to  time  t. 

The  strong  law  of  large  numbers  together  with  a  standard 
renewal  theory  argument  imply  that 

lim  i  R6  (t) 
t— 

converges  to  a  constant  with  probability  I.  Thus  the 
expected  gain  rate  associated  with  6  is  defined  as 

g6  -  lim  ±  E[R6(t)l. 

t-**  L 

Let  g  »  sup  g  .  Using  the  adaptive  policy  for  any  fixed 
path,  let 

R(t)  =*  total  reward  received  up  to  time  t. 

Under  the  assumptions  given  below  we  show  that  the  rewards 
from  the  adaptive  policy  satisfy 

(1)  Pllim  t-1  R(t)  -  g*J  -  l. 

t-* 

It  will  then  follow  that 
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(2)  Urn  f1  E[R(t) ]  »  g\ 

The  assumptions: 

1.  There  is  an  a  priori  known  countable  set  of  stationary 
policies  A  <=  A  such  that 

sup[gX:  \  e  A}  »  g*.  ■ 

i 

2.  For  each  state,  the  expectation  and  the  variance  of 
the  time  and  reward  until  state  1  is  next  reached  Is 
uniformly  bounded  over  A . 

3.  For  each  state,  the  expected  time  to  return  to  state  1 
is  uniformly  bounded  away  from  0  over  A. 

Note  that  uniform  bounds  over  S  x  A  are  not  needed. 
Our  proof  uses  assumptions  1,  2,  and  3  directly.  In  [151 
it  is  shown  that  assumptions  2  and  3  on  A  rather  than  A 
imply  assumption  1,  thus  effectively  eliminating  the  need 
for  assumption  1. 

Alternatively,  conditions  on  the  transition  matrices, 
one-step  reward  distributions,  and  one— step  time  distri¬ 
butions  can  be  given  which  imply  assumptions  2  and  3. 

They  are: 

2'.  The  means  and  variances  of  the  one-step  times 
and  rewards  are  uniformly  bounded  from  above  over  actions 
and  states. 
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3'.  The  mean  of  the  one— step  times  is  uniformly 
bounded  away  from  zero  over  states  and  actions. 

4'.  The  semi-Markov  process  associated  with  any 
stationary  policy  is  regular  and  the  mean  and  variance  of 
the  time  to  get  to  state  1  is  uniformly  bounded  over  S  x  A. 

To  ensure  that  our  adaptive  policy  will  concentrate 
with  probability  1  on  the  high  gain  rate  stationary  policies, 
only  policies  in  A  are  tried  and  each  policy  in  A  is  tried 
infinitely  often.  The  basic  unit  used  in  defining  our 
policy  is  the  state  1— to— state  1  cycle.  Within  each 
state  1— to— state  1  cycle  the  same  stationary  policy  is 
used. 

Beginning  in  the  initial  state  some  fixed  stationary 
policy  is  applied  until  state  1  is  reached.  From  this 
point  forward  we  have  a  one— state  problem  since  policy 
choices  are  only  made  at  state  .1.  Following  Mallows  and 
Robbins  [12]  our  strategy  specifies  a  sequence  of  positive 
integers  which  number  the  forced— choice  cycles  in  which 
predetermined  stationary  policies  are  applied.  Let 
sll*  S12J  increasing  sequence  of  positive  integers 

with  *  1.  Let  Sg^,  s22>  •••  b®  a  second  disjoint  sequence 
with  s^,  s^2)  •  ••  being  a  third  sequence  disjoint  from  the 
first  two,  and  so  on.  If  for  the  n— th  cycle  n  »  sB^,  f°r 
some  6 ,  t  we  use  stationary  policy  6  for  the  n— th  cycle; 
otherwise  we  choose  the  stationary  policy  with  the  leading 
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observed  reward  rate.  The  observed  reward  rate  for  policy 
6  at  time  t  is  given  by 

RV)  , 

B  (t) 

where  R6{t)  is  the  total  reward  received  and  B6(t)  is  the 
total  time  spent  prior  to  t  while  6  was  applied.  The 
relationship  is  defined  only  for  those  6  which  have  been 
applied  in  a  forced-choice  cycle  prior  to  t.  Let  s(n)  be 
the  total  number  of  forced  choices  up  to  the  n— th  cycle, 
i.e.,  the  number  of  integers  s6l  which  are  <  n.  We  choose 
the  so  that 


A  choice  which  satisfies  this  is  s(n)  *  log  n. 

In  sampling  policies  directly  rather  than  actions, 
we  do  not  fully  utilize  all  information  since  each  action 
is  associated  with  many  different  policies.  We  do  not 
know  a  general  remedy,  but  in  Section  4  we  give  a  modified 
policy  for  the  finite  case  that  uses  information  more 
efficiently . 
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3.  PROOF  OF  OPTIMALITY 

Let  the  random  variables  U  and  V  be,  respectively, 
reward  and  time  for  a  state  1— to— state  1  cycle  using  6.  In 
view  of  assumption  2,  U6  and  V ^  have  expectations  and 
variances  uniformly  bounded  over  A.  Call  this  bound  H. 

The  policy  we  use  essentially  reduces  the  original  problem 
to  a  one— state  problem,  where  the  transition  times  need 
not  be  constants.  As  in  [12],  we  can  define  a  nondecreasing 
sequence  {cn}  such  that 


(3)  cn  >  0,  £  c  2  < 
n 


n  1s(n)  cn  -  0. 


A  particular  choice  which  can  be  shown  to  satisfy  (3)  is 
c(n)  *»  --(nj  R^  where  R^  «  2  Denote  the  distribution 

of  U6  by  F6.  By  the  Markov  inequality 


FS(-cn>  <  H  of, 


6  e  A  . 


Hence,  if  is  the  policy  used  on  the  n-th  cycle  and  Un 
is  the  corresponding  reward,  then 


Pt"n  < 


cnJ  P<Dn  -  »)p6(^n><  HS2 


and  so  2  P[U  <  -c„}  converges  by  (3).  By  the  Borel-Cantelli 

n  n  —  i* 

lemma,  Un  <  — cfl  only  finitely  often  w.p.l;  so  w.p.l  there 
is  an  such  that  Un  >  -cn  for  all  n  >  N^.  A  similar 


s~ 


argument  shows  that  there  is  an  N2  such  that  w.p.l 
iU  l<  c  and  V  <  c  .  Hence,  by  (3),  we  can  neglect  the 
contribution  of  the  forced— choice  cycles  to  the  overall 
reward  and  time. 

At  time  t,  define  f^(t^  as  the  number  of  forced  choices 
prior  to  time  t  and  f2(t)  as  the  number  of  different  policies 
used  on  free  choices  prior  to  t.  Let  p(t)  «  the  number  of 
different  policies  used  prior  to  time  t.  Then 


fL(t)  >  p(t)  >  f2(t). 

Define  the  last  free-choice  cycle  prior  to  t  for  a  policy 
6  as  the  last  cycle,  if  any,  for  which  policy  6  was  chosen 
as  the  leader.  By  the  above  argtsnent  the  contribution  of 
the  last  free— choice  cycles  can  be  neglected.  Assumption  2 
implies  that  the  time  and  rewards  before  reaching 
state  1  for  the  first  time  can  also  be  neglected. 

Indexing  consecutive  cycles  by  m  and  excluding  the 
time  and  rewards  before  reaching  state  1  for  the  first 
time  we  define 

*  time  to  complete  i-th  cycle, 

6  j  time  to  complete  i— th  cycle  if  policy  6  is  used, 
^i  {  0  otherwise, 
m 

B(m)  -  2  V. 
i-1  x 
«  m  «. 

B°  (m)  -  Z  V.  . 
i-1  1 


I 
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c  t 

U^,  U^,  R(m),  R  (m)  are  defined  similarly  for  the  reward 
j  sequence.  Let 

NS(m)  «  the  number  of  the  first  m  cycles  which 
use  policy  6 . 

Fix  a  policy  a  e  A; 

r<m) -  £isi .  sg(B)/Ng(-)  . 

B°(m)  B°  (m)  /N°  (m) 


i 


By  the  SLLN  the  numerator  and  denominator  converge  to  a 
constant  w.p.l  as  N°<m)  -*  ®.  Since  the  union  of  two  null 
sets  is  null  and  N0 (m)  -  “  as  m  -  the  advanced  calculus 

result  that  the  limit  of  a  ratio  is  the  ratio  of  the  limits 
yields 

(4)  lim  l3 (m)  »  ga  w.p.l. 
m-*™ 

This  is  a  special  case  of  a  result  of  Pyke  and  Schaufele 
[14,  Theorem  5.1].  Strictly  speaking,  the  fact  that  the 
constant  is  ga  depends  on  the  lemma  proved  later  that 
relates  cycles  and  continuous  time.  Using  the  fact  that 
policy  a  is  applied  infinitely  often  (sq1,  *a2>  *••) 
forced-choice  cycles,  we  can  choose  ra  large  enough  to 
guarantee  in  advance  that  Na(m)  >  N  for  any  fixed  N.  Thus 
fixing  e^,  £2  >  0,  choose  mQ  such  that 
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P{^° (m)  >  g°-  €2  for  all  m  >  mQ}  >  1  -  , 

from  the  above  SLLN  argument. 

Let  r  be  the  set  of  policies  used  in  free  choices 
for  m  >  mQ.  For  y  e  r  and  m  >  mQ,  define  £Y(m)  as  the 
largest  cycle  index  <  m  where  y  was  freely  chosen.  Frc«n 
the  definitions, 

rV%)  -  1)  >  5° (4^ (m)  -  I), 

for  all  a  €  A. 

By  earlier  arguments  and  neglecting  the  contributions 
from  policies  not  used  after  cycle 

I(m)  -  2  RY(*Y(m)  -  1)/B(m)  +  o  (1), 

Yer  p 

where  the  last  term  goes  to  0  w.p.l  as  m  -  ",  Thus, 

R(m)  >  2  B°(-tY(m)  -  l)BY(*Y(m)  -  1)/B(m)  +  on(l) 

“  Yer  p 

>  (ga  -  e,^  2  BY(lY(«)  ~  1)/B(m)]  +  o  (1) 

4  ysT  p 

>  ga  -  e2  +  °p(l), 

since  the  term  in  square  brackets  goes  to  1  w.p.l  as 
m  -  ® ,  and  as  e2  was  arbitrary , 
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ltm  inf  |[(m)  >  g°  w.p.l. 
m-*” 

Since  0  was  arbitrary  and  a  denumerable  union  of  null  sets 
is  null, 

(5)  P(lim  inf  J(m)  >  g*)  -  1. 

tn-*» 

Now  (m)  -*  g^  w.p.l  as  o  -•  ®  so  that 

(6)  P(lim  sup  K(m)  <  g*)  «  1, 

nr**  ~” 

since  R(m)  is  a  weighted  average  of  the  I* (m) .  Thus  we 
have  proved 

(7)  P(lim  I(m)  ■  g*)  *  1. 

nr1® 

To  complete  the  proof  of  (1)  we  need  a  lemma  to  allow 
us  to  move  from  indexing  by  number  of  transitions  to 
indexing  by  time. 

LEMMA. 

(a)  lim  sup  <  •  w.p.l. 

ar*®  m 

(b)  lim  inf  >  0  w.p.l. 

nr*® 

Proof.  For  the  proof  of  the  lemma  a  more  abstract 
representation  of  the  process  is  required.  The  process 
can  be  viewed  as  a  sequence  of  functions  on  an  underlying 
measure  space.  See  [11]  or  t 15 J - 
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Define  the  o-field 

“j  ■  ""V  vi.  “i!  t  -  1,  2,  j) 

where  ct-J  .ems  the  smallest  e-field  over  which  {.)  ie 
measurable.  Thu,  £<^0^)  i,  the  expected  ^  tQ 
complete  the  i-th  cycle  given  the  previous  history  of  the 
process  while  £(Vi|01)  i,  the  expected  time  to  complete 
the  i-th  cycle  given  the  value  of  Dt,  the  stationary  policy 
used.  Note  that  ^  Bj  u  measurable  over  and 

E(Vj|3j~l)  “  E<VjlDj)* 

Clearly, 

B(ra)  -iSi[Vl  -  EfV^D^J  ECV^). 

Thus,  since  E(V± |DjL)  is  bounded  away  from  0  and  -  by 

assumptions  2  and  3,  it  suffices  to  show  that  the  first 
term  is  negligible, 

Et<vi  -  ®(v1|d1))Jd1j  -  0 

and 

J*!  Et(Vi  *"  E(vil»1))2J/i2 
<  H  I  1/i2  <  »  . 


By  a  standard  martingale  theorem  (Feller  (4,  pp.  234-238]), 


lim  i  2  (V.  —  E(Vj)Dj))  -0  w.p.lll. 
i*l  * 

Turning  to  proof  of  (2).  we  let  P  be  a  measure  cm  the 
reward  sequence  corresponding  to  our  policy.  For  any 
a  >  0, 

$  |K<t)|dP  <  a  J  [K(t)/a]2dP  <  H/a  -  0  as  a  - 
|K(t)|>a  jK(t)|>a 

Thus j  the  random  variables  [K(t)J  are  uniformly  integrable 
so  (1)  implies  (2).  (Loeve  [11],  p.  163.)  We  have  proved 
the  following. 

THEOREM.  Under  assumptions  1,  2,  and  3,  the  average 

rewards  from  the  above  strategy  satisfy 

(i)  P{ lim  t-1R(t)  -  g*}  «  1 
t-“ 

(ii)  lim  t_1  E[R(t)]  -  g\ 
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4.  REMARKS 

In  the  finite— state ,  finite-action  case  we  could 

sample  actions  on  each  transition  rather  than  policies 

on  each  cycle.  By  sampling  actions  on  forced— choice 

transitions  we  can  obtain  consistent  estimates  of  the 

parameters.  One  choice  is  the  natural  empirical  estimators 

which  are  consistent  (Moore  and  Pyke  [13]).  On  the  free- 

choice  transitions,  we  follow  the  leader  obtained  by 

substituting  these  estimates  for  the  unknown  parameters 

in  the  gain  rate  formula  for  stationary  policies.  The 

estimated  optimal  policy  can  be  calculated  using  a  policy 

improvement  routine  or  linear  program  (Fox  [5],  Denardo 

* 

and  Fox  [2]).  The  proof  that  g  is  attained  is  different 
than  the  above  one  since  we  do  not  reduce  the  problem  to 
one  state,  but  the  number  of  policies  being  finite  leads 
to  some  simplifications.  In  the  finite  case,  the  existence 
of  a  stationary  optimal  policy  makes  our  policy  optimal. 
Intuition  indicates  a  faster  convergence  rate  for  sampling 
actions  directly  rather  than  just  sampling  policies. 

Many  problems  studied  in  the  literature  satisfy  our 
assumptions;  for  example, 

(i)  replacement  problems  where  we  return  to  state  1 

(replace  the  Item)  whenever  the  state  (or  deterioration) 
exceeds  a  certain  level,  to  be  determined. 


(ii)  queuing  problems  where  we  activate  the  server  whenever 
the  queue  length  exceeds  a  certain  level,  to  be  deter¬ 
mined.  Reyman  [8]  gives  conditions  under  which  a  policy 
of  this  form  is  optimal  for  the  M/G/l  queue.  To  satisfy 
our  assumptions,  we  rule  out  policies  that  do  not 
activate  the  server  when  the  queue  length  exceeds  a 
given  (large)  number. 

(iii)  inventory  problems  were  we  determine  a  reorder  point 

and  a  reorder  level.  See  Hadley  and  Whit in  [7,  Chap.  8J. 
Lippman  [10]  mentions  another  example:  the  "streetwalker's 
dilemma,"  where  the  server  must  decide  whether  to  accept  a 
given  proposition  or  wait  for  a  more  desirable  one.  He 
gives  simple  conditions  tinder  which  our  assumptions  hold 
and  the  optimal  policy  has  the  form:  accept  an  offer  if 
and  only  if  the  ratio  of  expected  reward  to  expected  service 
time  exceeds  a  certain  number. 
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